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Abstract 

Cross-validation is a useful and generally ap- 
plicable technique often employed in machine 
learning, including decision tree induction. 
An important disadvantage of straightfor- 
ward implementation of the technique is its 
computational overhead. In this paper we 
show that, for decision trees, the computa- 
tional overhead of cross-validation can be re- 
duced significantly by integrating the cross- 
validation with the normal decision tree in- 
duction process. We discuss how existing de- 
cision tree algorithms can be adapted to this 
aim, and provide an analysis of the speedups 
these adaptations may yield. The analysis is 
supported by experimental results. 

1. Introduction 

Cross-validation is a generally applicable and very use- 
ful technique for many tasks often encountered in ma- 
chine learning, such as accuracy estimation, feature 
selection or parameter tuning. It consists of partition- 
ing a data set D into n subsets Di and then running 
a given algorithm n times, each time using a different 
training set D — Di and validating the results on Di. 

Cross-validation is used within a wide range of ma- 
chine learning approaches, such as instance based 
learning, artificial neural networks, or decision tree 
induction. As an example of its use within decision 
tree induction, the CART system |^ employs a tree 
pruning method that is based on trading off predic- 
tive accuracy versus tree complexity; this trade-off is 
governed by a parameter that is optimized using cross- 
validation. 

While cross-validation has many advantages for cer- 
tain tasks, an often mentioned disadvantage is that 
it is computationally expensive. Indeed, n-fold cross- 
validation is typically implemented by running the 
same learning system n times, each time on a differ- 
ent training set of size (n — i)/n times the size of the 



original data set. Because of this computational cost, 
cross-validation is sometimes avoided, even when it is 
agreed that the method would be useful. 

It is clear, however, that when (for instance) a specific 
decision tree induction algorithm is run several times 
on highly similar datasets, there will be redundancy in 
the computations. E.g., when selecting the best test in 
a node of a tree, the test needs to be evaluated against 
each individual example in the training set. In an n- 
fold cross-validation each example occurs n—1 times as 
a training example, which means that each test will be 
evaluated on each training example n — 1 times. The 
question naturally arises whether it would be possible 
to avoid such redundant computations, thereby speed- 
ing up the cross-validation process. In this text we 
provide an affirmative answer to this question. 

This paper is organised as follows. In Section 2 we 
focus on refinement of a single node of the tree; we 
identify the computations that are prone to the kind 
of redundancy mentioned above, indicate how this re- 
dundancy can be reduced, and analyse to what extent 
performance can thus be improved. In Section 3 we 
discuss the whole tree induction process, showing how 
our adapted node refinement algorithm fits in several 
tree induction algorithms. In Section 4 we present 
experimental results for one of these algorithms that 
support our complexity analysis, supporting our main 
claim that cross-validation can be integrated with de- 
cision tree induction in such a way that it causes only 
a small overhead. In Section 5 we briefiy discuss to 
what extent the results generalize to other machine 
learning techniques, and mention the limitations of our 
approach. In Section 6 we conclude. 

2. Efficient Cross-vafidation 
2.1 Decision Tree Induction 

We describe decision tree induction algorithms only in 
such detail as needed for the remainder of this text, 
for more details see Quinlan (1993) or Breiman et al. 
(1984). 



function GROW_TREE(r: set of examples) 

returns decision tree: 

t* :— optimal_test(T) 

P partition induced on T by t* 

if stop_criterion('P) 

then return leaf(info(T)) 

else 

for all Pj in P: 

tvj := GROW_tree(Pj ) 
return node(r, {j^iijMj)]) 



Figure 1. A generic algorithm for top-down induction of 
decision trees. 

Decision trees are usually built top-down, using an al- 
gorithm similar to the one shown in Figure |l|. Basi- 
cally, given a data set, a node is created and a test is 
selected for that node. A test is a function from the 
example space to some finite domain (e.g., the value of 
a discrete attribute, or the boolean result of a compar- 
ison between an attribute and some constant). Each 
test induces a partition of the data set (with each test 
result one subset is associated), and typically that test 
is selected for which the subsets of the partition are 
maximally homogeneous w.r.t. some target attribute 
(the "class" , for classification trees) . For each subset 
of the partition, the procedure is repeated and the cre- 
ated nodes become children of the current node. The 
procedure stops when stop_criterion succeeds: this 
is typically the case when no good test can be found or 
when the data set is sufficiently homogeneous already. 
In that case the subset becomes a leaf of the tree and 
in this leaf information about the subset is stored (e.g., 
the majority class). The result of the initial call of the 
algorithm is the full decision tree. 

The refinement of a single node (selecting the test and 
partitioning the data) can in more detail be described 
as follows: 

for all tests t that can be put in the node: 
for all examples e in the training set T: 
update_statistics(S[t], t(e), target(e)) 
Q[t] :— compute_quality(S[t]) 

t* :— argmaxt Q[t] 

partition T according to t* 

The computation of the quality of a test t is split into 
two phases here: one phase where statistics on t are 
computed and stored into an array S [t] , and a second 
phase where the quality is computed from the statistics 
(without looking back at the data set). For instance, 
for classification trees, phase one could compute the 



class distribution for each outcome of the test.|^ Qual- 
ity criteria such as information gain or gain ratio | ]To[ 
can easily be computed from this in phase two. For 
regression, using variance as a quality criterion a 
similar two-phase process can be defined : the vari- 
ance can be computed from '^{yf,yi, 1) where the yi 
are the target values. 

2.2 Removing Redundancy 

2.2.1 Overlapping Data Sets 

Now assume that the node refinement process, as de- 
scribed above, is repeated several times, each time on a 
slightly different data set Ti (i.e., the Ti have many ex- 
amples in common). We assume here that the same set 
of tests is considered in all these nodes. Then instead 
of running the process n times, with n the number of 
data sets, the following algorithm can be used: 

for each test t that can be put in the node 
for each example e in IJ,. T^: 
for each i such that e E Ti: 

update_statistics(S[ri, t], t{e), target(e)) 
for each T^: 

Q[Ti,t] :— compute_quality(iS'[Ti, i]) 
for each T^: 

t* argmaxt Q[Ti,t] 
for each different test t* among the t* : 

partition Uil^iK* = t*} according to t* 

This algorithm performs the same computations as 
running the original one once on each data set, except 
for two differences: 

• for each test t, each single example e is tested 
only once instead of m(e) times, where m{e) is 
the number of data sets the example occurs in. 

• each single example e is sorted into a child node^ 
/(e) times, instead of m{e) times, with /(e) the 
number of different best tests for all the data 
sets where the example occurred (obviously Ve : 
/(e) < m(e)). 

Note that in each node of the tree multiple tests (at 
most n), and correspondingly multiple sets of child 
nodes, may now be stored instead of just one. 

^S\t\ is then a matrix indexed on classes and results of 
t, and update_statistics(S'[t], t{e), class{e)) just increments 

S[t]t(e),class{e) by 1. 

^ Sorting examples into child nodes corresponds to par- 
titioning the data set. 



2.2.2 Cross-validation 

For an n-fold cross-validation, each single example oc- 
curs exactly n — 1 times as a training example. Hence, 
the time needed to compute the statistics of all tests 
is reduced by a factor n — 1 compared to running the 
original algorithm n times. The time needed to sort 
examples into child nodes is reduced by n — 1 if the 
same test is selected in all folds, otherwise a smaller 
reduction occurs. Besides this speedup there are no 
changes in the computational complexity of the algo- 
rithm (except for the extra computations involved in, 
e.g., selecting elements from a two-dimensional array 
instead of a one-dimensional array). 

Specifically for cross-validation, the algorithm can be 
further improved if the employed statistics 5, for any 
data set D, can be computed from the corresponding 
statistics of its subsets in a partition. This holds for 
all statistics that are essentially sums (such as those 
mentioned in Section 2.1), since in that case S{D) = 
J2i S(Di). Such statistics could also be called additive. 

In an n-fold cross-validation, the data set D is par- 
titioned into n sets Di, and the training sets Tj can 
be defined as D — Di. It is then sufficient to com- 
pute statistics just for the Di] those for the Ti can be 
easily computed from this without further reference 
to the data (first compute S{D) = ^iS{Di)] then 
S{Ti) = S{D) — S{Di)). Since each example occurs in 
exactly 1 of the Di, updating statistics has to be done 
only N times instead of iV(n — 1) times (with N the 
number of examples). 

2.2.3 Cross-validation Combined with Actual 
Tree Induction 

In practice, cross-validation is usually performed in 
addition to building a tree from the whole data set: 
this tree is then considered to be the actual hypothesis 
proposed by the algorithm, and the cross-validation is 
done just to estimate the predictive accuracy of the 
hypothesis or for parameter tuning. The algorithm 
for efficient cross-validation can easily be extended so 
that it builds a tree from the whole data set in addition 
to the cross-validation trees (just add a virtual fold 
where the whole data set is used as training set; note 
that S{Tq) = S{D)). Adopting this change, we obtain 
the algorithm in Figure H. In the remainder of this text 
we will refer to this algorithm as the parallel algorithm, 
as opposed to the straightforward method of running 
all cross-validation folds and the actual tree induction 
serially (the serial algorithm) . 

At this point, we have discussed the major issues re- 
lated to the refinement of a single node. The next step 



{ 13 is the set of all examples relevant for this node, 
partitioned into n subsets Di, i = l..n. 
To = D, and for i > Ti = D - Di } 

1. for each test t that can be put in the node 

2. for each example e in D: 

3. choose i such that e £ Di 

4. update_statistics(S'[Di, i], t{e), target(e)) 

5. compute S[Ti,t] [i = 0..n) from aU S[Dj,t] 

6. for each T: 

7. Q[Ti,t] := compute_quality(S'[7i, i]) 

8. for each Ti : 

9. ti argmaxt Q[T,t] 

10. for each different test t* among the t*: 

11. partition Ui{'^*l*i = ^*} according to t* 



Figure 2. Performing cross-validation in parallel with in- 
duction of the actual tree. 



is to include this process into a full tree induction al- 
gorithm. This will be discussed in the next section, 
but first we take a look at the complexity of the node 
refinement step. 

2.3 Computational Complexity of Node 
Refinement 

Let te be the time for extracting relevant information 
from a single example (i.e., the example's target value 
and test result) and updating the statistics matrix S 
(in other words, executing line 4 in the algorithm in 
Figure || once); tp the time needed to test an example 
and sort it into the correct subset during partition- 
ing; TV the number of examples in the data set, n the 
number of folds, and a the number of tests. Then we 
obtain the following times for refining a single node 
(the Ci denote terms constant in N): 

• when building one tree from the full data set: 
Tactual = aNte + Ntp + Ci = N{ate + tp) + Cl 

• when performing cross-validation serially: 

Tl fold = ^iV(aie + tp) + C2 

Tn folds = {n- l)N{ate + tp) + C3 

• when serially building the actual tree and per- 
forming a cross-validation: 



;ial 



Tactual + Tn folds — nN{ate + tp) + C4 



• when using the parallel algorithm, worst case (all 
folds select different tests): 

^parallel = aNte + TlNtp -|- C5 = N {atf, + ntp) + C5 

• when using the parallel algorithm, best case (all 
folds select the same test): 

T'parallcl = ^(^^e + tp) + Cg 



Our analysis gives rise to approximate upper bounds 
on the speedup factors that can be achieved. Assuming 
large N so that the ci terms can be ignored (hence 
"approximate"), for the worst case we get 



-^serial Cl^e ~1~ tp 

— = n — < n 



^ parallel 



ate + ntr) 



and 



rial 



f parallel 



Hence the worst case speedup factor is bounded by 
min(rt, 1 + ate/tp). It will approximate n when a) N 
becomes large and b) tp is small compared to ate- In 
the best case, where the same test is selected for all 
folds, we just get Tscrmi/Tp^^^i^^i < n: the speedup fac- 
tor approaches n as soon as N becomes large. Another 
way to look at this is to observe that T^g^^^ngi/Tg^ctuai 
approaches one; in other words, for large N and a sta- 
ble problem (where small perturbations in the data do 
not lead to different tests being selected) the overhead 
caused by performing cross-validation becomes negli- 
gible. 

3. An Algorithm for Building Trees in 
Parallel 

We now describe how the above algorithms for node 
refinement fit in decision tree induction algorithms. 
First we describe the data structures, which are more 
complicated than when growing individual trees. Next 
we discuss several decision tree induction techniques 
and show how they can exploit the above algorithms. 

3.1 Data Structures 

Since the parallel cross-validation algorithm builds 
multiple trees at the same time, we need a data struc- 
ture to store all these trees together. We refer to this 
structure as a "forest", although this might be some- 
what misleading as the trees are not disjoint, but may 
share some parts. 

An example of a forest is shown in Figure |^. In 
this figure two kinds of internal nodes are represented. 
The small squares represent bifurcation points, points 
where the trees of different folds start to differ because 
different tests were selected. The larger rectangles rep- 
resent tests that partition the relevant data set. The 
way in which the trees in the forest split the data sets 
is illustrated by means of an example data set of 12 
elements on which a three-fold cross-validation is per- 
formed. 
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Figure 3. An example forest for a 3-foId cross-validation. 



Note that the memory consumption of a forest is 
(roughly) at most n + 1 times that of a single tree 
(this happens when at the root different tests are ob- 
tained for all n folds plus the actual tree), which in 
practice is not problematic since n usually is small. 

When in the following we refer to nodes in the forest, 
we always refer to the test nodes, making abstraction 
of bifurcation points. E.g., in Figure I the root node 
has five children, three of which are leaves. 

3.2 Tree Induction Algorithms 

3.2.1 Depth First Tree Induction 

Probably the best known approach to decision tree in- 
duction is Quinlan's (1986) IDS algorithm, later devel- 
oped into C4.5 |l^. IDS basically follows the depth- 
first approach of Figure |l]. 

The simplest way to adapt an ID3-like algorithm to 
perform cross-validation in parallel with the actual 
tree building, is to make it use the node refinement al- 
gorithm of Figure |^ and call the algorithm recursively 
for each child node created. Note that the number of 
such child nodes is now X)f=i with / the number of 
different tests selected as best test in some fold and 
the number of possible results of the i-th test. 

In this way, the above mentioned speedup is ob- 
tained as long as the same test is chosen in all cross- 
validations and in the actual tree. The more differ- 
ent tests are selected, the less speedup is achieved; 
and when in each fold a different test is selected, the 
speedup factor goes to 1 (all folds are handled sepa- 
rately). 

To see how this process influences the total forest in- 



duction time, let us define t^ii) as the average time 
that is needed to refine ah the nodes of a single tree 
on level i for a data set of size |D|, and f{i) as the 
average number of different tests selected on level i of 
the forest (averaged over all nodes on that level of the 
forest). The computational complexity of the whole 
forest building process can then be approximated as 

^parallel = tr{l) + /(l)i.(2) + /(2)t,(3) + • • • 

for the parallel version, and, assuming that refinement 
time is linear in the number of examples in nodes that 
are to be refined.^ 

Tscrial = ntr{l) + ntr{2) + ntr{3) H 

for the serial version (we obtain ntr{i) and not {n + 
l)tr{i) because the n folds have size 

Thus the total speedup will be between 1 and n, and 
will be higher for stable problems (low f{i)) than for 
unstable problems (most f{i) close to n + 1). 

3.2.2 Level-wise Tree Induction 

Most decision tree induction algorithms assume that 
all data reside in main memory. When inducing a tree 
from a large database, this may not be realistic: data 
have to be loaded from disk into main memory when 
needed, and then for efficiency reasons it is important 
to minimize the number of times each example needs 
to be loaded (i.e., minimize disk access). To that aim 
alternative tree induction algorithms have been pro- 
posed lH, |l^ that build the tree one whole level at a 
time, where for each level one pass through the data is 
required. The idea is to go over the data and for each 
example, update statistics for all possible tests in the 
node (of the currently lowest level of the tree) where 
the example belongs. For each node the best test is 
then selected from these statistics without more ac- 
cess to the data. 

Since in these approaches, too, the computation of the 
quality of tests is split up into two phases (comput- 
ing statistics from data, computing test quality from 
statistics), it is easy to see how such level- wise algo- 
rithms can be adapted. When processing one example, 
instead of looking up the single node in the tree where 
the example belongs, one should look up all the nodes 
in the forest where the example belongs (for an exam- 
ple not yet in a leaf this is at least one node and at 
most n — 1 nodes, with n the number of folds) and 
update the statistics in all these nodes. 

When data reside on disk, the number of examples 
is typically large and both and tp are large (due 

^From this it follows that in one fold of n-fold cross- 
validation the actual refinement time for level i is ^^^tr{i). 



to external data access). The constant terms Ci then 
become negligible very quickly, and the speedup factor 
can approach n if a > fi^. Assuming that tp and ie 
are comparable, this will be true as soon as a > n, 
which in practice often holds. 

3.3 Further optimisations 

As soon as different tests are selected for different 
folds, the forest induction process bifurcates in the 
sense that from that point onwards different trees 
in the forest will be handled independently. A fur- 
ther optimisation that comes to mind, is removing re- 
dundancy among computations in these independently 
handled trees as well. 

Referring to Figure among the different branches 
created by a bifurcation point (square node) there may 
still be some overlap with respect to the tests that will 
be considered in the child nodes, as well as the relevant 
examples. For instance, in the lower right of the forest 
in Figure |[ in the children of the "test B" node one 
needs to consider all tests except A and B, and in the 
children of the test C node one needs to consider all 
tests except A and C. Since the relevant example set 
for fold /s at that point ({2,3,5}) overlaps with that of 
folds /i and /a ({2,3,5,10,12}), all tests besides A, B 
and C will give rise to some redundant computations. 

Removing this redundancy as well would give rise to 
a more thorough redesign of the forest induction pro- 
cess; it seems that for best results the depth-first tree 
induction method should be abandoned, and a level- 
wise method adopted instead. Here we will not discuss 
this optimisation any further but focus on the above 
described algorithm, which is simple and compatible 
with both tree induction approaches and can easily be 
integrated in existing tree induction systems. 

4. Experimental Evaluation 
4.1 Implementation 

We implemented Algorithm || as a module of Tilde 
Q, an ILP system (inductive logic programming |^) 
that induces first order decision trees; briefly, these 
are decision trees where a test in a node is a first order 
literal or conjunction, and a path from root to leaf can 
be interpreted as a Horn clause. Literals belonging to 
different nodes in such a path may share variables. 

A typical property of ILP systems in general, and 
Tilde is no exception, is that because tests are first 
order conjunctions, both the number of tests and the 
time needed to perform a test may be large. This 
translates to large a, and tp values in our complex- 
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For these experiments we used the version of Tilde as 
implemented within the ACE data mining toolQ §; 
this version is a depth-first ID3-like algorithm that 
keeps all data in main memory. 

With these experiments we aim at a better un- 
derstanding of the behaviour of the parallel cross- 
validation process. We measure how much speedup 
the parallel procedure yields, compared to the serial 
one; how the overhead of the parallel procedure varies 
with the number of folds; and how much time is spent 
by both procedures on different levels of the tree. 

The parallel and serial procedures make use of exactly 
the same implementation of Tilde except for the dif- 
ferences between parallel and serial execution as de- 
scribed in this text. The different procedures are com- 
pared pairwise for the following data sets: 

• SB (Simple Bongard) and CB (Complex Bon- 
gard): several artificially generated sets of so- 
called "Bongard" problems Q (pictures are clas- 
sified according to simple geometric patterns). SB 
contains 1453 examples with a simple underlying 
theory, CB 1521 examples with a more complex 
theory. 



• Muta: the Mutagenesis data set [|lj 
benchmark (230 examples) 



an ILP 



• ASM: a subset of 999 examples of the so-called 
"Adaptive Systems Management" data set, kindly 
provided to us by Perot Systems Nederland. 

• Mach: "Machines" , a tiny data set (15 examples) 
described in ||l| 

The number of tests in each node varied from 3 to a few 
hundred (as tests are first-order clauses, their number 
may vary greatly even among nodes of the same tree). 

4.3 Results 

Table I compares the actual tree building time Tq, the 
time for serially performing 10- fold cross-validation in 
addition to the actual tree building Tg, and the time 
needed by the parallel algorithm Tp. In addition to 
these, the speedup factor S = Tg/Tp is shown as as 

"^ACE is available for academic purposes upon request. 



Table 1. Timings of parallel and serial execution on various 
data sets (in seconds). 




Figure 4. Ts and Tp relative to Ta- The part above the 
horizontal line is the overhead Os respectively Op. 



well as the overhead caused by performing the cross- 
vaHdation (O^ = 100(r,./Ta - 1)%, similarly for Op). 
Os and Op are plotted graphically in Figure ^. 

The lowest overhead is achieved for Simple Bongard, 
which has a relatively large number of examples and a 
simple theory. The simplicity of the true theory causes 
the induced trees to be exactly the same in most folds, 
yielding little bifurcation. For Complex Bongard, the 
effect of bifurcation is more prominent. For ASM, a 
real-world data set for which a perfect theory may not 
exist, the overhead of cross-validation is relatively high 
(but still better than for the serial algorithm). For Ma- 
chines, the overhead is relatively large but still smaller 
than for the serial algorithm; i.e., even for small ex- 
ample sets the parallel algorithm yields a speedup. 

For Mutagenesis we obtained less good results. Two 
factors turned out to be responsible for this: instability 
of the trees, but also high variance in the complexity 
of testing examples. The latter is due to the fact that 
first-order queries have exponential worst-case com- 
plexity; most of them are reasonably fast, but a very 
few of them may dominate the others, time-wise. Such 
behaviour typically occurs at lower levels of the tree, 
as will be confirmed when we look at Figure ^. 

Figure ^ shows how cross-validation overhead varies 
with the number of folds for the CB and ASM data 
sets. The result for CB confirms our expectation that 
n has a small influence on the total time, but for ASM 
the overhead increases with increasing n. 
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The latter result can be understood by looking at the 
graphs in Figure where the total time spent on each 
level of the tree by the parallel and the serial pro- 
cedure is plotted, together with the f{i — 1) values as 
defined previously. The graphs clearly show that when 
/ goes up, the per-level speedup factor is reduced. For 
CB, this happens at a point where the total refine- 
ment time is already small, so it does not influence 
the overall speedup factor much; but for ASM and 
Muta / increases almost immediately. Note that in 
the part where / is high, many folds are handled in- 
dependently and cross-validation becomes linear in n, 
which explains the increase of the ASM data in Fig- 
ure ^. It is also clear in Figure ^ how the time spent on 
some lower levels suddenly goes up; this is the effect 
of stumbling upon some very complex tests. 

5. Applicability and Limitations 

Although we have studied efficient cross-validation in 
the context of decision trees, the principles explained 
here are also applicable outside this domain. For in- 
stance, rule set induction systems typically build a rule 
by consecutively adding a "best" condition to it until 
no further improvement occurs. Similar to our forest- 
building algorithm, cross-validation of such rules could 
be performed in parallel with the construction of the 
actual rule set, avoiding redundant computations. 

It is less clear, however, how the technique could be 
used with models that contain only continuous param- 
eters, such as neural networks. We obtain the greatest 
speedups for stable trees, where the same test is chosen 
in different folds. With continuous models, no compu- 



Figure 6. Total refinement time per level. 



tations will ever be exactly the same, hence removal 
of exactly redundant computations as explained here 
will in general not be possible. 

Also within decision tree induction a number of limi- 
tations exist. A first one is related to the use of contin- 
uous parameters in the tree. Decision tree induction 
systems often construct inequality tests for continuous 
attributes (e.g., A < 5.3) where the constant is gener- 
ated from the currently relevant data. Even for stable 
problems where the same test is usually selected for 
different folds, there may be small differences in the 
constants that make the tests look different. Solving 
this problem requires extra optimisations. 

A second limitation is that the proposed techniques 
concern the tree building phase only. This phase is 
typically followed by tree post-pruning, and may be 
preceded by data pre-processing, such as discretiza- 
tion of attributes |^ . While these other phases usually 
take much less time than the tree building phase, when 
they are not negligible and n is large they may become 
the bottleneck, limiting the usefulness of our approach 
(unless optimisations similar to the ones discussed here 
are also possible in these phases). 

6. Conclusions 

We have shown that in the context of decision tree in- 
duction the benefits of cross-validation are available for 
a relatively low overhead, if the cross-validation is care- 
fully integrated with the normal tree building process. 



Comparing experimental results with an analytical es- 
timate of this overhead, we have identified a number 
of disturbing factors, such as variance in test complex- 
ity (which causes variance in the overhead) and tree 
instability (which causes the overhead to increase on 
average). These factors increase the overhead, but in 
all cases it was still smaller than for the serial cross- 
validation procedure, and in the best cases there was 
only a small overhead over the normal tree induction 
process. 

The ideas underlying our approach are also applicable 
outside the decision tree context, e.g., for rule induc- 
tion, but not immediately for induction of models that 
have only continuous parameters. 

Possible further improvements to the technique include 
specific adaptations for handling tests with continu- 
ous values. Also, the algorithms we have discussed 
are fairly simple versions; the SPRINT system for in- 
stance [11 is much more sophisticated with respect to 
the statistics it keeps, and adaptations to the system 
along the lines of this paper would be correspondingly 
complex to implement. 

Related work includes that of Moore and Lee (1994) , 
who discuss efficient cross-validation in the context of 
model selection. Their approach differs substantially 
from ours in that they obtain efficiency by quickly 
abandoning models that after seeing some examples 
have low probability of ever becoming the best model; 
i.e., they save on the number of cases a model is evalu- 
ated on during cross-validation, whereas our work fo- 
cuses on removing redundancy in the model building 
process itself. 

Blocked et al. (2000) discuss a technique similar to 
the one described here. The main difference is is in 
the kind of redundancies that are removed; here the 
redundancies arise from running the same test in dif- 
ferent folds of a cross-validation, whereas in Blocked et 
al. (2000) they are caused by similarities in different 
tests (the tests being first-order conjunctions, which 
might be similar up to one literal). Both approaches 
can easily be combined, and such work is in progress. 
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